17 research outputs found
Reinforcement Learning from Passive Data via Latent Intentions
Passive observational data, such as human videos, is abundant and rich in
information, yet remains largely untapped by current RL methods. Perhaps
surprisingly, we show that passive data, despite not having reward or action
labels, can still be used to learn features that accelerate downstream RL. Our
approach learns from passive data by modeling intentions: measuring how the
likelihood of future outcomes change when the agent acts to achieve a
particular task. We propose a temporal difference learning objective to learn
about intentions, resulting in an algorithm similar to conventional RL, but
which learns entirely from passive data. When optimizing this objective, our
agent simultaneously learns representations of states, of policies, and of
possible outcomes in an environment, all from raw observational data. Both
theoretically and empirically, this scheme learns features amenable for value
prediction for downstream tasks, and our experiments demonstrate the ability to
learn from many forms of passive data, including cross-embodiment video data
and YouTube videos.Comment: Accompanying website at https://dibyaghosh.com/icvf
HIQL: Offline Goal-Conditioned RL with Latent States as Actions
Unsupervised pre-training has recently become the bedrock for computer vision
and natural language processing. In reinforcement learning (RL),
goal-conditioned RL can potentially provide an analogous self-supervised
approach for making use of large quantities of unlabeled (reward-free) data.
However, building effective algorithms for goal-conditioned RL that can learn
directly from diverse offline data is challenging, because it is hard to
accurately estimate the exact value function for faraway goals. Nonetheless,
goal-reaching problems exhibit structure, such that reaching distant goals
entails first passing through closer subgoals. This structure can be very
useful, as assessing the quality of actions for nearby goals is typically
easier than for more distant goals. Based on this idea, we propose a
hierarchical algorithm for goal-conditioned RL from offline data. Using one
action-free value function, we learn two policies that allow us to exploit this
structure: a high-level policy that treats states as actions and predicts (a
latent representation of) a subgoal and a low-level policy that predicts the
action for reaching this subgoal. Through analysis and didactic examples, we
show how this hierarchical decomposition makes our method robust to noise in
the estimated value function. We then apply our method to offline goal-reaching
benchmarks, showing that our method can solve long-horizon tasks that stymie
prior methods, can scale to high-dimensional image observations, and can
readily make use of action-free data. Our code is available at
https://seohong.me/projects/hiql